This is the first time I have opened RStudio, so I am feeling a bit confused. But I am sure that this enviroment will become more familiar during this course. I think that this course will be interesting since it seems to be a course where you learn by doing. The course is also about two contemporary subjects: open data and open science. Also, it is about two really contemporary subjects: open data and open science. What I expect from this course are at least the following things:
I found out about this course from Weboodi when I was planning what courses I should take this semester.
Here’s a link to my github:
https://github.com/mikkohyy/IODS-project
Here’s another way of creating a link with R Markdown.
library(GGally)
library(ggplot2)
setwd("data/")
learning_data <- read.csv(file="learning2014.csv")
str(learning_data)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning_data)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
This is a dataset that consists of 7 variables and 166 rows - or data about 166 students. The variables could be classified to three groups: 1) background information (gender, age, attitude towards statistic), 2) the learning style of the student (deep, strategic, surface) and how well the student did on the test (points). The data was collected from students of a statistic course.
ggpairs(learning_data,mapping=aes(col=gender,alpha=0.3),lower=list(combo=wrap("facethist",bins=20)),upper=list(continuous=wrap("cor",size=3)))
summary(learning_data)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The students are quite young (median=22) and there are more females (110) than males (56). The learning style of the students have more charasteristics of “deep learning” (mean=3.68) than “surface learning” (mean=2.787). The strongest correlation that can be found is between attitude and points (0.437). Other interesting, but not that strong, correlations are deep & attitude, stra & age, stra & points, and surf & points (negative).
Surf is interesting since it has a negative correlation with various variables. But this is also quite self-evident since, for example, deep learning and surface learning are the opposites. There are also some interesting correlations with gender - e.g. with males age has a negative correlation with points.
I think that the general impression is that attitude has a strong correlation with points and surface learning style correlates negatively with various variables.
I build my model on the assumption that “a good attitude” and the combination of deep and strategic learning will lead to good results in the test. Seems reasonable, right? Hence, I will use attitude, deep and stra in my model:
model <- lm(points~attitude + deep + surf, data=learning_data)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
From the summary of the model we can see that only attitude can be considered to be statistically highly significant (P<0.001 or ***) variable in this case. Or, in other words, there is a really low probability that this variable (in this case attitude) is not relevant in relation to the target variable. Hence, we can conclude that there is a statistical relationship between attitude and points.
Since deep and surf were not statistically significant we change our model. Our new model is simple and will only have attitude as explanatory variable:
new_model <- lm(points~attitude, data=learning_data)
summary(new_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
What this model tells us is that high value on attitude seems to increase the points a student gets from a test. In other words, every point on attitude increases the points by an average of 3.53. This is quite logical since I think that scoring high on attitude implies that a student is motivated.
The multiple R-squared basically describes how well “the model fits the set of observations” (this was the definition in the blog). Basically, in this case, it tells that how much of the variation in the points is explained only by attitude. This model’s multiple R-squared is 0.1906 which means that roughly 19% of the variation is explained by attitude.
There are some assumptions in regression models. Naturally, one of the assumptions is linearity. Other one is that errors are normally distributed. And it possible to analyze the validity of these assumptions by analyzing the residuals of the models. This is done by asking if the following assumptions are valid: 1) the errors of the model are normally distributed, 2) the size of the errors should not depend on the explanatory variables (or that errors have constant variance), and 3) the errors are not correlated.
par(mfrow=c(2,2))
plot(model,which=c(2,1,5))
Residuals vs Fitted values
Residuals vs Fitted values makes it possible to explore if there is problems with the constant variance assumption. Since there does not seem to be any pattern in the scatter plot, there does not seem to be any problems with this assumption.
Normal QQ-plot
Normal QQ-plot makes it possible to explore if there are problems that relate to the assumption of normal distribution of errors. It seems that “the dots fit the line” relatively well. Although there are some anomalities in the beginning and the end. But I would conclude that there is no problems with this assumption.
Residuals vs Leverage
Residual vs Leverage makes it possible to explore if there are some observations that have a high impact on the model - e.g. if there is a value that has a very high leverage it makes the model less reliable. In my model no single value has a high leverage. Nothing “pulls” and no single value has a high leverage.
library(ggplot2)
library(dplyr)
library(boot)
setwd("data/")
alc_data <- read.csv("part3_alc_data.csv")
names(alc_data)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alc_data)
## [1] 382 35
The data is about students in “secondary education of two Portugese schools”. In this data set there are 382 students and 35 variables. The variables describe various attributes such as information about student’s home, their afterschool activities and how well they do in school. The data set was collected “using school reports and questionnaires”. The ages of the students are mostly between 15 and 18 (although there are few that are slightly older).
The data used in this exercise has been combined from two different data sets (first one about the performance on math and the second one Portugese language). The students who were in both of those data sets were selected for this data set. Numeric variables were combined by taking the average of the variables of different data sets - i.e. grades on this data set are the average of the grades of math and Portugese language. Hence, it can be claimed to indicate how well the student did in general
(As a side note, the data I am using (which I wrangled) is different from the data that is linked in the analysis assessment. The difference is in grades. It seems that in the data that was already wrangled the grades are the math-grades (I understood that it should be the average of math and por). So my results may be different from those that are based on the already wrangled data..)
The variables I chose were:
summary(select(alc_data,.data$G3,.data$goout,.data$studytime,.data$internet))
## G3 goout studytime internet
## Min. : 0.00 Min. :1.000 Min. :1.000 no : 58
## 1st Qu.:10.00 1st Qu.:2.000 1st Qu.:1.000 yes:324
## Median :12.00 Median :3.000 Median :2.000
## Mean :11.46 Mean :3.113 Mean :2.037
## 3rd Qu.:14.00 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :18.00 Max. :5.000 Max. :4.000
It seems that there are outgoing persons (mean 3.113) and most of them have internet at home. The grades seem to be skewed towards higher grades and the students do not seem to that exited about using their time studying (mean is 2.037).
Box and bar plots
par(mar=c(4,4,0.2,0.1))
g.g3 <- ggplot(data=alc_data,aes(x=high_use,y=G3,fill=high_use))
g.g3 + geom_boxplot() + ylab("Final grade")
g.going_out <- ggplot(data=alc_data,aes(x=goout,fill=high_use))
g.going_out + geom_bar() + ylab("Going out")
g.studytime <- ggplot(data=alc_data,aes(x=studytime,fill=high_use))
g.studytime + geom_bar()
g.internet <- ggplot(data=alc_data,aes(x=internet,fill=high_use))
g.internet + geom_bar()
There is at least some differences in grades in relation to high_use - e.g. the median is a bit higher. From the Going out barplot we can see that when goout>3 the group that has high_use becomes much bigger than the group that does not have high_use. With the cases of internet, studytime and goout we get more insight with crosstables:
t.internet <- table(alc_data$internet,alc_data$high_use,dnn=c("internet","high_alc"))
addmargins(round(prop.table(t.internet)*100,1))
## high_alc
## internet FALSE TRUE Sum
## no 11.3 3.9 15.2
## yes 58.9 25.9 84.8
## Sum 70.2 29.8 100.0
t.studytime <- table(alc_data$studytime,alc_data$high_use,dnn=c("studytime","hich_alc"))
addmargins(round(prop.table(t.studytime)*100,1))
## hich_alc
## studytime FALSE TRUE Sum
## 1 15.2 11.0 26.2
## 2 35.3 15.7 51.0
## 3 13.6 2.1 15.7
## 4 6.0 1.0 7.0
## Sum 70.1 29.8 99.9
t.goout <- table(alc_data$goout,alc_data$high_use,dnn=c("goout","high_alc"))
addmargins(round(prop.table(t.goout)*100,1))
## high_alc
## goout FALSE TRUE Sum
## 1 5.0 0.8 5.8
## 2 22.0 4.2 26.2
## 3 27.0 6.0 33.0
## 4 10.7 10.5 21.2
## 5 5.5 8.4 13.9
## Sum 70.2 29.9 100.1
When compared to my hypotheses, these explorations lead to following conclusions:
the_model <- glm(high_use~G3+goout+studytime+internet,data=alc_data, family="binomial")
summary(the_model)
##
## Call:
## glm(formula = high_use ~ G3 + goout + studytime + internet, family = "binomial",
## data = alc_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7427 -0.7849 -0.5479 0.9523 2.5187
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.80189 0.69364 -2.598 0.009385 **
## G3 -0.03626 0.03875 -0.936 0.349408
## goout 0.72858 0.11802 6.173 6.68e-10 ***
## studytime -0.57985 0.16616 -3.490 0.000483 ***
## internetyes 0.11896 0.35473 0.335 0.737368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 399.89 on 377 degrees of freedom
## AIC: 409.89
##
## Number of Fisher Scoring iterations: 4
What I find surprising is that G3 (or final grade) is not significant in realtion to high_use. But goout and studytime are significant. And as I thought, having internet at home does not have any significance. From the model we can say that going out has a positive connection (i.e. increase in goout will increase the odds of high_use being true) of and studytime has a negative connection with high_use being true. G3 seems to have a really small negative connection and having internet at home has a positive connection (i.e. having a internet increases the odds of having a high_alc) - but neither of them is significant.
The difference between null deviance (465.68) and Residual deviance (399.89) basically describes how good fit the model is. In the case of this model the difference is 65.79. Which indicates that this model is not bad.
Odds ratios and confidence intervals:
oddr <- coef(the_model) %>% exp
confinterv <- confint(the_model) %>% exp
cbind(oddr,confinterv)
## oddr 2.5 % 97.5 %
## (Intercept) 0.1649873 0.0409081 0.6261624
## G3 0.9643898 0.8936694 1.0407265
## goout 2.0721458 1.6539622 2.6296307
## studytime 0.5599823 0.3999663 0.7686348
## internetyes 1.1263211 0.5709167 2.3115264
When we look at the oddr-column we see how the variables affect the odds of having high_use (in this context >1 indicates positive relationship). Goout has relatively large positive relationship - it implies that when goout increases with 1, the odds that the student has high_use increases by 2 (or puts student in 2 times greater odds of having high_use). Confidence intervals suggest that we can say that we can be fairly sure that goout has positive connection and also that studytime has a negative connection. But for example having internet at home is quite useless since its confidence interval is quite wide.
My hypothesis on internet was proved to be quite wrong if we look at the model (it has a positive connection). Another surprising thing is that G3 does not have a connection (or it has only negative 0.03) with alcohol use. Therefore, my hypothesis based on traditional thinking was false. Goout and studytime seemed to have the role that I though they would. Next question is how well my model predicts.
Graph:
better_model <- glm(high_use~goout+studytime,data=alc_data, family="binomial")
probs <- predict(better_model, type="response")
alc_data <- mutate(alc_data,probability=probs)
alc_data <- mutate(alc_data, prediction=probability>0.5)
plot <- ggplot(alc_data,aes(x=probability,y=high_use,col=prediction))
plot + geom_point()
Crosstable:
pred.table <- table(high_use=alc_data$high_use,prediction=alc_data$prediction)
addmargins(prop.table(pred.table))
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64659686 0.05497382 0.70157068
## TRUE 0.19109948 0.10732984 0.29842932
## Sum 0.83769634 0.16230366 1.00000000
Penalty:
lossf <- function(class,prob) {
wrong <- abs(class-prob) > 0.5
mean(wrong)
}
lossf(class=alc_data$high_use,prob=alc_data$probability)
## [1] 0.2460733
This model’s penalty is approximately 0.25 which means that it predicted 75% of the cases right. It seems that my model has difficulties predicting cases that have high_use. It predicted 19% of cases wrong when there was a high_use (it claimed that these cases did not have high_use).
I think that random guessing would be a straighforward and simple guessing strategy:
guess <- sample(c(TRUE,FALSE),size=382,replace=TRUE)
wrong <- alc_data$high_use==guess
mean(wrong)
## [1] 0.4790576
I ran this few times and the penalty was always somewhere between 0.45 and 0.55. Hence, it seems that my model predicts better than just guessing randomly.
crossv <- cv.glm(alc_data,cost=lossf,glmfit=better_model,K=10)
crossv$delta[1]
## [1] 0.2460733
According to crossvalidation, my model seems to be a bit better than the on in Data Camp (error of ~0.25 is slightly better than error of ~0.26 - but only slightly).
At the beginning there is a model with 10 variables. After each “round” one varible is removed until we have the same model (goout and studytime) that was build in the earlier sections. In the following example there is the code for the first round. For second round I removed traveltime and continued this until there were only two variables in the model.
massive_model <- glm(high_use~goout+studytime+Fedu+Fjob+Medu+Mjob+freetime+romantic+sex+traveltime,data=alc_data, family="binomial")
crossv <- cv.glm(alc_data,cost=lossf,glmfit=massive_model,K=10)
penalty <- lossf(class=alc_data$high_use,prob=alc_data$probability)
results <- matrix(nrow=9,ncol=3)
results[1,] <- c(10,penalty,crossv$delta[1])
# Matrix of each round:
results
## [,1] [,2] [,3]
## [1,] 10 0.2094241 0.2329843
## [2,] 9 0.2146597 0.2303665
## [3,] 8 0.2251309 0.2382199
## [4,] 7 0.2277487 0.2513089
## [5,] 6 0.2146597 0.2408377
## [6,] 5 0.2277487 0.2356021
## [7,] 4 0.2146597 0.2356021
## [8,] 3 0.2460733 0.2356021
## [9,] 2 0.2460733 0.2460733
results.df <- as.data.frame(results)
ggplot(data=results.df,aes(x=V1)) + geom_line(aes(y=V2,colour="V2")) + geom_point(aes(y=V2,colour="V2")) + geom_line(aes(y=V3,colour="V3")) + geom_point(aes(y=V3,colour="V3")) + scale_x_continuous(trans = "reverse", breaks = unique(results.df$V1)) + ylab("Error") + xlab("Number of predictors")
Red line (V2) describes the value that was produced by the loss function. Blue line (V3) is the value that was produced by cv.glm.
library(MASS)
library(GGally)
library(ggplot2)
library(tidyverse)
library(corrplot)
data(Boston)
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The Boston dataset is from MASS package (which is a supplementary package to a book about applied statistics). It has 506 dimensions (or observations) and 14 variables. Its content is described as “Housing Values in Suburbs of Boston”. The variables are quite heterogenous. Some examples of these variables are crime rate by town, nitrogen oxides concentration and pupil-teacher ratio by town. All variables are numerical (although chas is a dummy variable).
First, let’s look at the data:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
These summaries tell us that the scales of these variables are quite different from each other. There are variables with max value of 8.780 (rm) and variables with max value of 711.0 (tax). Some of these variables seem to have a bit odd structure - e.g. crime rate’s (crim) median is 0.25 but the maximum is ~88.98.
Then a plot that is made with ggpairs (I adjusted the theme a bit to make it more readable):
ggpairs(Boston,upper=list(continuous=wrap("cor",size=2.2)), lower=list(continuous=wrap("points",size=0.5))) + theme(axis.text=element_text(size=5),panel.grid.major=element_blank(),axis.ticks=element_blank(),panel.border=element_rect(linetype="solid", colour="black",fill=NA),strip.text=element_text(size=7))
There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. From the scatter plots we can see find out at least one clear (negative) correlation between median value of homes (medv) and the lower status% (stat). There are also some variables of which most of its values are in the “lower”end of the scale (e.g. crim & chas). The correlations are probably much easier to outline from a correlation matrix:
cor_mtx <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),type="upper",tl.cex=0.7)
There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.
Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:
boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.
Then we will craate a categorical variable called crime which is based on the quantiles of crim variable:
brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim,breaks=brks,label=lbls,include.lowest=TRUE)
boston_sc$crime <- crime
boston_sc <- boston_sc[,-1] # Remove the old crim variable
summary(boston_sc$crime)
## low med_low med_high high
## 127 126 126 127
Then let’s divide the data into training (80%) and testing sets (20%):
n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]
Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.
First we will do a linear discriminant analysis on the training set (train_set). The categorical crime variable is the target variable and rest of the variables are used as predictors:
boston_lda_train <- lda(crime~., data=train_set)
boston_lda_train
## Call:
## lda(crime ~ ., data = train_set)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2450495 0.2425743 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 0.94573776 -0.8918444 -0.08835242 -0.8766634 0.4846250
## med_low -0.08727013 -0.2984225 -0.11325431 -0.5865564 -0.1520837
## med_high -0.37173443 0.1173494 0.12941585 0.3621768 0.2057588
## high -0.48724019 1.0149946 -0.07547406 1.0218098 -0.4083083
## age dis rad tax ptratio
## low -0.8983330 0.8604052 -0.6888511 -0.7394791 -0.49705377
## med_low -0.3186936 0.4166193 -0.5468458 -0.4808148 -0.07044209
## med_high 0.4281029 -0.3567054 -0.4369353 -0.3377952 -0.29855251
## high 0.8391849 -0.8501178 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.37425395 -0.7702561 0.57473366
## med_low 0.32283348 -0.1183556 -0.00400634
## med_high 0.07692219 -0.0212199 0.24063140
## high -0.79917564 0.9368687 -0.69813348
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11009175 0.600697615 -0.96496793
## indus 0.03668967 -0.085856934 0.35958484
## chas -0.11094361 -0.007324431 0.02639561
## nox 0.18901740 -0.905592433 -1.27330483
## rm -0.21702879 -0.134490047 -0.24265253
## age 0.24810857 -0.449537145 0.06485414
## dis -0.07973223 -0.257889566 0.35563520
## rad 4.01800167 1.004616351 -0.08563280
## tax 0.18397484 -0.057174140 0.46276973
## ptratio 0.09969602 -0.009830561 -0.29405328
## black -0.13891441 0.008532644 0.12138499
## lstat 0.14667070 -0.245824088 0.28296524
## medv 0.19384738 -0.381586529 -0.16122736
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9640 0.0269 0.0091
Then a (bi)plot of the LDA (with colors based on the crime rate - for some reason pch=crime_as_n, did not work in this plot):
crime_as_n <- as.numeric(train_set$crime)
plot(boston_lda_train,dimen=2,col=crime_as_n)
The end result looks a bit different if we compare it to the plot that was in data camp exercise (I do not know if this is a bad thing, and if it is a bad thing, how bad it is).
Then we will test how good the model we created works on a data that it has not seen before (test_set). First thing to do is to remove the “right answers” from the test_set (crime is the column number 14):
correct_from_test <- test_set[,14]
test_set <- test_set[,-14]
Then we will use the lda model to classify the cases (their crime rates) from the test_set and illustrate the results with a cross-table:
boston_pred <- predict(boston_lda_train,newdata=test_set)
addmargins(table(correct=correct_from_test,predicted=boston_pred$class))
## predicted
## correct low med_low med_high high Sum
## low 14 6 0 0 20
## med_low 8 15 4 0 27
## med_high 1 7 17 3 28
## high 0 0 1 26 27
## Sum 23 28 22 29 102
It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.
Re- loading and standardizing the Boston dataset:
data(Boston)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
First we will calculate the distances between the observations. The very basic method, euclidean, is used in this calculation:
boston_euc_dis <- dist(boston_again, method="euclidean")
summary(boston_euc_dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
For evaluating the best number of clusters I will use the function that was introduced in Data Camp exercice (max number of clusters in this evaluation is 10):
set.seed(22)
cluster_n <- 10
wcss <- sapply(1:cluster_n, function(k){kmeans(boston_again,k)$tot.withinss})
qplot(x=1:cluster_n,y=wcss,geom="line")
I think that the best number of clusters is 3 since the distances do not decrease much after that point. Another possible number could be 5, but when using it in a visualization, the result is quite messy. Hence, I will go with the classic approach of “less is more”. Next, is the visualization of the clusters when the number of clusters is 3:
boston_km <- kmeans(boston_again,centers=3)
pairs(boston_again,col=boston_km$cluster,cex=0.2)
It seems that the “black cluster” jumps out in each plot (of course it could also be that it is because of its color which is much more visible from the small graphs). There are three variables that seem to have (in most cases) clear clusters in them: black, crim and lstat. My interpration is that the variables black, crim and lstat have an important role in the model and classification - i.e. the clusters are heavily influenced by these variables. It also seems that the cluster that has the color black has “a strong identity” since it is clearly visible in most of the plots. Hence, it is clearly different from the other clusters.
I will perform the k-means with four clusters (the arrow-function I use is the same that was in the Data Camp exercise):
data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda.arrows(boston_lda, myscale=2)
It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).
model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(boston_lda_train$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% boston_lda_train$scaling
matrix_product <- as.data.frame(matrix_product)
# install.packages("plotly")
library(plotly)
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=crime_as_n,size=I(40))
plot_ly(x=matrix_product$LD1, y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col,size=I(40))
The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.
Setting up:
library(dplyr)
library(ggplot2)
library(GGally)
library(FactoMineR)
library(tidyr)
setwd("data/")
human_data <- read.csv("human.csv",row.names=1)
summary(human_data)
## edu_ratio labr_ratio exp_edu_yrs life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni mat_mor_rat adol_brate parl_rep_pct
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The scales and units of the variables are quite different from each other - e.g. edu-ratio’s scale is from 0.1717 to 1.4967 and gni’s scale is from 581 to 123124. Gni is also quite interesting variable since third quartile is 24512 but the max is 123124.
ggpairs(human_data, lower=list(continuous=wrap("points",size=0.8))) + theme(axis.text.y=element_text(size=7), axis.text.x=element_text(angle=45, hjust=1, size=7),panel.grid.major=element_blank(),axis.ticks=element_blank(),panel.border=element_rect(linetype="solid", colour="black",fill=NA),strip.text=element_text(size=6))
Some of the variables (e.g. gni and mat_mor_rate) are more or less skewed to the right. Other variables (e.g. labr_ratio and life_exp) are more skewed to the left but not as skewed as those that are skewed to the right.
There are quite strong correlations between some variables. For example, life_exp and exp_yrs_edu has a strong positive correlation (0.789). Also, adole_brate and mat_mor_rate have strong correlations with various variables. In the case of mat_mor_rate they are mostly negative correlations. Not surprisingly, correlations are strong between variables that are “logically” connected to each other - e.g. life_exp has quite strong positive correlation with gni and strong negative correlation with mat_mor_rate.
Principal component analysis:
pca_human_data <- prcomp(human_data)
The variability captured by the principal components:
temp <- summary(pca_human_data)
pca_human_pct <- round(100*temp$importance[2,],2)
pca_human_pct
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
Principal component 1 captures the variance quite well (or almost all (99.9%) of it of it).
labels <- paste0(names(pca_human_pct), " (", pca_human_pct, "%)")
The biplot:
biplot(pca_human_data, cex=c(0.6,0.7), col=c("gray47","red"), xlab=labels[1], ylab=labels[2])
Figure 1: A biplot of non-standardized data where PC1 explain 99.99% of the variation. In this analysis National Gross Income seems to be the most important variable and basically the data is reduced in the differences that the countries have in National Gross Income.
Standardizing the data:
human_data_st <- scale(human_data)
The variability captured by the principal components in the standardized data:
pca_human_data_st <- prcomp(human_data_st)
temp_st <- summary(pca_human_data_st)
pca_human_pct_st <- round(100*temp_st$importance[2,],1)
pca_human_pct_st
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
Creating biplot with the standardized data:
labels_st <- paste0(names(pca_human_pct_st), " (", pca_human_pct_st, "%)")
Creating biplot with the standardized data:
biplot(pca_human_data_st, cex=c(0.6,0.7), col=c("gray47","red"), xlab=labels_st[1], ylab=labels_st[2])
Figure 2: A biplot of the standardized data. In the plot there are three groups of variables. Labour ratio and Percentage of female representatives in parliament attribute to PC2 and the rest of the variables to PC1. Combined these components collect 69.8% of the variance in the data.
The plots are very different from each other. The first figure shows that PC1 catches 99.99% of the variation. The most important variable in building this component seems to have been Gross National Income and the component basically is based on that variable. We probably could use only one axis (or even one variable (gni)) to represent the not-standardized data and we would not loose much information about the dataset.
The first two components (PC1: 54.6, PC2: 16.2%) of the pca the standardized data together catch 70.8 variation in the data. From the Figure 2 we can findthree groups variables that have relatively strong relation to the components. labr_ratio and parl_rep_pct attribute to PC2 and the other variables to PC1.
In the non-standardized data set the variables were really different from each other in unit and scale - e.g. the already mentioned difference between gni and edu_ratio. Hence, in the first PCA too much weight was given to gni.
In standardized data, PC1 (which ammounts 53.6% of the variance) seems to consist of two groups of variables. In the first group there are Expected years of schooling, Life expectancy at birth, Gross National Income per capita and Ratio of female and male populations with secondary education in each country. Second group consists of Maternal mortality ratio and Adolescent birth rate. These groups correlate negatively with each other. This PC1 resembles the division of countries to developed or developing countries - i.e. long education and life expectacy imply that the country is developed and high maternal mortality ratio imply that the country is developed.
Second component seems to consists of variables that measure the gender equality since its elements are Labour ratio and Percent of females in parliament. Surprising thing is that Education ratio is more connected to PC1.
As a summary we have two components: one that shows the variation in the degree of “development” and another that shows how equal the country is.
The tea dataset looks like this:
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Visualization in three parts (to improve the readability):
gather(tea[1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust= 1, size=8))
gather(tea[13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1, size=8))
gather(tea[25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1, size=8))
From the documentation we find out that the tea dataset consists of 300 answers to a “questionnaire on tea”. There we three groups of questions in it: 1) how the person consumed tea (18 questions), 2) “what are their product’s perception” (12 questions), and 3) personal details (4 questions). In addition, there are two extra variables in the data. In the following Multiple Correspondence Analysis I will use to variables that are about “perception of tea” (columns 25-36) and variables that tell us about the persons who answered the questions: sex and age_Q.
First I will create a new data set with the variables we want to use:
tea_set <- tea[c(20,23,25:36)]
Then MCA-analysis and visualization:
tea_mca <- MCA(tea_set, graph=FALSE)
summary(tea_mca)
##
## Call:
## MCA(X = tea_set, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.137 0.113 0.098 0.095 0.092 0.085
## % of var. 11.279 9.276 8.056 7.804 7.561 7.035
## Cumulative % of var. 11.279 20.554 28.610 36.414 43.975 51.010
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.076 0.071 0.066 0.062 0.059 0.054
## % of var. 6.225 5.814 5.439 5.120 4.899 4.443
## Cumulative % of var. 57.235 63.048 68.487 73.608 78.507 82.950
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17
## Variance 0.049 0.047 0.040 0.038 0.034
## % of var. 4.005 3.883 3.297 3.101 2.765
## Cumulative % of var. 86.954 90.837 94.134 97.235 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2
## 1 | 0.649 1.025 0.275 | -0.331 0.323 0.071 |
## 2 | 0.199 0.096 0.030 | 0.374 0.415 0.106 |
## 3 | -0.015 0.001 0.000 | -0.244 0.175 0.067 |
## 4 | 0.307 0.229 0.082 | -0.419 0.520 0.154 |
## 5 | 0.269 0.176 0.059 | 0.038 0.004 0.001 |
## 6 | 0.599 0.873 0.311 | -0.617 1.125 0.330 |
## 7 | 0.592 0.854 0.241 | -0.581 0.998 0.232 |
## 8 | -0.251 0.153 0.062 | -0.516 0.786 0.264 |
## 9 | 0.605 0.890 0.254 | -0.376 0.419 0.098 |
## 10 | 0.241 0.141 0.051 | -0.440 0.574 0.172 |
## Dim.3 ctr cos2
## 1 -0.641 1.402 0.269 |
## 2 -0.586 1.170 0.259 |
## 3 -0.582 1.154 0.381 |
## 4 0.357 0.435 0.112 |
## 5 -0.022 0.002 0.000 |
## 6 -0.240 0.197 0.050 |
## 7 -0.458 0.714 0.144 |
## 8 -0.150 0.077 0.022 |
## 9 -0.108 0.040 0.008 |
## 10 -0.486 0.805 0.209 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## F | -0.501 7.759 0.366 -10.458 | -0.122 0.559
## M | 0.731 11.320 0.366 10.458 | 0.178 0.816
## 15-24 | -0.056 0.050 0.001 -0.643 | -0.681 9.017
## 25-34 | 0.606 4.410 0.110 5.730 | 0.433 2.729
## 35-44 | -0.091 0.057 0.001 -0.615 | -0.513 2.223
## 45-59 | -0.186 0.365 0.009 -1.621 | 0.712 6.534
## +60 | -0.572 2.163 0.047 -3.768 | 0.260 0.544
## escape-exoticism | -0.129 0.408 0.015 -2.107 | 0.307 2.838
## Not.escape-exoticism | 0.115 0.366 0.015 2.107 | -0.276 2.551
## Not.spirituality | 0.133 0.634 0.039 3.407 | 0.028 0.034
## cos2 v.test Dim.3 ctr cos2 v.test
## F 0.022 -2.546 | -0.048 0.099 0.003 -0.996 |
## M 0.022 2.546 | 0.070 0.144 0.003 0.996 |
## 15-24 0.205 -7.830 | 0.458 4.698 0.093 5.267 |
## 25-34 0.056 4.087 | 0.652 7.146 0.127 6.164 |
## 35-44 0.040 -3.477 | -0.493 2.371 0.037 -3.347 |
## 45-59 0.129 6.218 | -0.796 9.399 0.162 -6.950 |
## +60 0.010 1.713 | -0.497 2.282 0.036 -3.271 |
## escape-exoticism 0.085 5.041 | 0.270 2.523 0.066 4.429 |
## Not.escape-exoticism 0.085 -5.041 | -0.243 2.268 0.066 -4.429 |
## Not.spirituality 0.002 0.715 | -0.376 7.080 0.309 -9.619 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## sex | 0.366 0.022 0.003 |
## age_Q | 0.135 0.332 0.355 |
## escape.exoticism | 0.015 0.085 0.066 |
## spirituality | 0.039 0.002 0.309 |
## healthy | 0.174 0.059 0.082 |
## diuretic | 0.182 0.110 0.070 |
## friendliness | 0.143 0.000 0.026 |
## iron.absorption | 0.055 0.006 0.005 |
## feminine | 0.435 0.009 0.005 |
## sophisticated | 0.145 0.038 0.166 |
plot(tea_mca, invisible=c("ind"), habillage="quali",cex=0.8,ylim=c(-1,1),xlim=c(-1,1))
First of all, the dimensions (Dim1: 11.28% and Dim2: 9.28%) do not catch much of the variation in the data.
First of all, the dimensions in this plot (Dim1: 11.28% and Dim2: 9.28%) do not catch much of the variation in the data. As a rough interpretation we can say that variables that describe that tea has “sublime” attributes are relatively close to each other - and the same is true with those variables that are the opposite. These groups are connected to the gender (i.e. F is close to the sublime attributes, M is close to the opposite). It could be interpreted that Dim1 catches the difference between how different genders perceive tea.
Interesting aspect of the plot is the position of the different age groups. Age groups 35-44 and 15-24 resemble each other but 25-34, which is in the between of those groups, does not resemble them. There is also a difference between how the groups of 45-59 and 60+ perceive tea. It seems that Dim2 has to do with age.
I wondered how the type of tea would appear in realtion to the other variables:
tea_set_type <- tea[c(13,20,23,25:36)]
tea_mca <- MCA(tea_set_type, graph=FALSE)
plot(tea_mca, invisible=c("ind"), habillage="quali",cex=0.8,ylim=c(-1.5,1.5),xlim=c(-1.5,1.5))
The biggest surprise is that green tea is not associated with those “sublime” attributes. It also seems that the age groups 25-34 and 15-24 differ from the other age groups. Also, Earl grey is tea that the younger groups seem to prefer.